2025-05-01-15-00
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Abstract
arXiv:2504.21277v1 Announce Type: new Abstract: The integration of reinforcement learning (RL) into the reasoning capabilities of Multimodal Large Language Models (MLLMs) has rapidly emerged as a transformative research direction. While MLLMs significantly extend Large Language Models (LLMs) to handle diverse modalities such as vision, audio, and video, enabling robust reasoning across multimodal inputs remains a major challenge. This survey systematically reviews recent advances in RL-based reasoning for MLLMs, covering key algorithmic designs, reward mechanism innovations, and practical applications. We highlight two main RL paradigms--value-free and value-based methods--and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information. Furthermore, we provide an extensive overview of benchmark datasets, evaluation protocols, and existing limitations, and propose future research directions to address current bottlenecks such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints. Our goal is to offer a comprehensive and structured guide to researchers interested in advancing RL-based reasoning in the multimodal era.
摘要
将强化学习(RL)融入多模态大语言模型(MLLMs)的推理能力,已迅速成为一个变革性的研究方向。尽管MLLMs显著扩展了大语言模型(LLMs)处理视觉、音频和视频等多种模态的能力,但实现跨模态输入的稳健推理仍面临重大挑战。本文系统综述了基于RL的MLLMs推理的最新进展,涵盖关键算法设计、奖励机制创新及实际应用。我们重点分析了两种主要RL范式——无价值函数与基于价值函数的方法,并阐释了RL如何通过优化推理轨迹和对齐多模态信息来增强推理能力。此外,我们全面梳理了基准数据集、评估协议及现有局限性,并针对稀疏奖励、低效跨模态推理和现实部署约束等当前瓶颈问题,提出了未来研究方向。本研究旨在为推进多模态时代基于RL的推理研究提供全面而结构化的指南。
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks
Abstract
arXiv:2504.21074v1 Announce Type: new Abstract: Large language models (LLMs) have shown to be valuable tools for tackling process mining tasks. Existing studies report on their capability to support various data-driven process analyses and even, to some extent, that they are able to reason about how processes work. This reasoning ability suggests that there is potential for LLMs to tackle semantics-aware process mining tasks, which are tasks that rely on an understanding of the meaning of activities and their relationships. Examples of these include process discovery, where the meaning of activities can indicate their dependency, whereas in anomaly detection the meaning can be used to recognize process behavior that is abnormal. In this paper, we systematically explore the capabilities of LLMs for such tasks. Unlike prior work, which largely evaluates LLMs in their default state, we investigate their utility through both in-context learning and supervised fine-tuning. Concretely, we define five process mining tasks requiring semantic understanding and provide extensive benchmarking datasets for evaluation. Our experiments reveal that while LLMs struggle with challenging process mining tasks when used out of the box or with minimal in-context examples, they achieve strong performance when fine-tuned for these tasks across a broad range of process types and industries.
摘要
大型语言模型(LLMs)已被证明是解决流程挖掘任务的有力工具。现有研究证实其能够支持多种数据驱动的流程分析,甚至在一定程度上具备对流程运作原理的推理能力。这种推理能力表明LLMs具备处理语义感知流程挖掘任务的潜力,这类任务依赖于对活动含义及其关系的理解。例如在流程发现中,活动含义可指示其依赖关系;而在异常检测中,语义信息可用于识别异常流程行为。本文系统性地探索了LLMs在此类任务中的能力。与主要评估默认状态下LLMs的先前研究不同,我们通过上下文学习和监督微调两种方式考察其实用性。具体而言,我们定义了五项需要语义理解的流程挖掘任务,并提供大量基准数据集用于评估。实验表明:虽然LLMs在直接使用或仅提供少量上下文示例时难以应对具有挑战性的流程挖掘任务,但经过针对不同流程类型和行业的任务微调后,其表现显著提升。
Theoretical Foundations for Semantic Cognition in Artificial Intelligence
Abstract
arXiv:2504.21218v1 Announce Type: new Abstract: This monograph presents a modular cognitive architecture for artificial intelligence grounded in the formal modeling of belief as structured semantic state. Belief states are defined as dynamic ensembles of linguistic expressions embedded within a navigable manifold, where operators enable assimilation, abstraction, nullification, memory, and introspection. Drawing from philosophy, cognitive science, and neuroscience, we develop a layered framework that enables self-regulating epistemic agents capable of reflective, goal-directed thought. At the core of this framework is the epistemic vacuum: a class of semantically inert cognitive states that serves as the conceptual origin of belief space. From this foundation, the Null Tower arises as a generative structure recursively built through internal representational capacities. The theoretical constructs are designed to be implementable in both symbolic and neural systems, including large language models, hybrid agents, and adaptive memory architectures. This work offers a foundational substrate for constructing agents that reason, remember, and regulate their beliefs in structured, interpretable ways.
摘要
本专著提出了一种基于信念作为结构化语义状态形式化建模的模块化人工智能认知架构。信念状态被定义为嵌入可导航流形中的语言表达动态集合,其中操作符支持同化、抽象、消解、记忆和内省等功能。通过整合哲学、认知科学与神经科学的研究成果,我们构建了一个分层框架,使具备自我调节能力的认知主体能够进行反思性和目标导向的思维。该框架的核心是认知真空——一类语义惰性的认知状态,作为信念空间的概念起源。在此基础上,零塔结构通过内部表征能力的递归构建而生成。这些理论构造设计适用于符号系统和神经系统实现,包括大语言模型、混合智能体和自适应记忆架构。本研究为构建具有结构化、可解释性的推理、记忆和信念调节能力的智能体提供了基础性框架。
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index
Abstract
arXiv:2504.21282v1 Announce Type: new Abstract: Natural language (NL)-driven table discovery identifies relevant tables from large table repositories based on NL queries. While current deep-learning-based methods using the traditional dense vector search pipeline, i.e., representation-index-search, achieve remarkable accuracy, they face several limitations that impede further performance improvements: (i) the errors accumulated during the table representation and indexing phases affect the subsequent search accuracy; and (ii) insufficient query-table interaction hinders effective semantic alignment, impeding accuracy improvements. In this paper, we propose a novel framework Birdie, using a differentiable search index. It unifies the indexing and search into a single encoder-decoder language model, thus getting rid of error accumulations. Birdie first assigns each table a prefix-aware identifier and leverages a large language model-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting. Extensive experiments demonstrate that Birdie outperforms state-of-the-art dense methods by 16.8% in accuracy, and reduces forgetting by over 90% compared to other continual learning approaches.
摘要
基于自然语言(NL)驱动的表格发现技术通过NL查询从大规模表格库中识别相关表格。尽管当前基于深度学习的传统稠密向量检索流程(即表示-索引-搜索)方法取得了显著精度,但仍存在限制性能进一步提升的若干问题:(i)表格表示和索引阶段积累的误差会影响后续搜索精度;(ii)查询-表格交互不足阻碍了有效的语义对齐,制约精度提升。本文提出新型框架Birdie,采用可微分搜索索引技术,将索引与搜索统一整合至单个编码器-解码器语言模型中,从而消除误差累积。Birdie首先为每个表格分配前缀感知标识符,并利用基于大语言模型的查询生成器为每个表格创建合成查询;随后将合成查询/表格与其对应表格标识符的映射关系编码至编码器-解码器语言模型的参数中,实现深度查询-表格交互。搜索阶段,训练后的模型直接为给定查询生成表格标识符。为适应动态表格的持续索引需求,我们通过参数隔离引入索引更新策略,显著缓解灾难性遗忘问题。大量实验表明,Birdie在准确率上超越最先进稠密方法16.8%,相比其他持续学习方法减少90%以上的遗忘率。
Phi-4-reasoning Technical Report
Abstract
arXiv:2504.21318v1 Announce Type: new Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
摘要
我们推出Phi-4-reasoning——一个140亿参数的推理模型,该模型在复杂推理任务中表现出色。该模型通过对Phi-4进行监督微调训练而成,训练数据包括精心筛选的具有适当复杂度与多样性的"可教学"提示集,以及使用o3-mini生成的推理演示。Phi-4-reasoning能生成充分利用推理时计算资源的详细推理链。我们还开发了增强版Phi-4-reasoning-plus,该变体通过短期基于结果的强化学习进一步提升了性能,可生成更长的推理轨迹。在各类推理任务中,这两个模型的性能显著优于DeepSeek-R1-Distill-Llama-70B等更大规模的开源权重模型,并接近完整版DeepSeek-R1模型的水平。我们的综合评估涵盖数学与科学推理、编程、算法问题求解、规划及空间理解等基准测试。值得注意的是,我们还观察到模型在通用基准测试上也获得了显著提升。本报告详细阐述了训练数据构成、训练方法及评估过程。研究表明,监督微调(SFT)中精细数据筛选的优势同样适用于推理语言模型,且可通过强化学习(RL)进一步放大。最后,我们的评估指出了当前推理模型性能与鲁棒性评估方法的改进空间。
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
Abstract
arXiv:2504.21411v1 Announce Type: new Abstract: Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.
摘要
Galvatron是一个用于高效训练大规模基础模型的分布式系统。该系统通过自动识别最优混合并行策略(包含数据并行、张量并行、流水线并行、分片数据并行、序列并行以及重计算技术),克服了人工选择并行策略的复杂性。系统架构包含三个核心组件:用于硬件与模型分析的性能分析器、基于决策树与动态规划的策略优化搜索引擎,以及高效执行策略的运行时系统。在不同集群上的基准测试表明,Galvatron的吞吐量显著优于现有框架。这一开源系统提供用户友好接口与完整文档,使复杂分布式训练变得高效易用。Galvatron源代码已发布于https://github.com/PKU-DAIR/Hetu-Galvatron。
ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning
Abstract
arXiv:2504.21370v1 Announce Type: new Abstract: Reasoning models such as OpenAI o3 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks through extended Chain-of-Thought (CoT) prompting. While longer reasoning traces can facilitate a more thorough exploration of solution paths for complex problems, researchers have observed that these models often "overthink", leading to inefficient inference. In this paper, we introduce ShorterBetter, a simple yet effective reinforcement learning methed that enables reasoning language models to discover their own optimal CoT lengths without human intervention. By sampling multiple outputs per problem and defining the Sample Optimal Length (SOL) as the shortest correct response among all the outputs, our method dynamically guides the model toward optimal inference lengths. Applied to the DeepSeek-Distill-Qwen-1.5B model, ShorterBetter achieves up to an 80% reduction in output length on both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our analysis shows that overly long reasoning traces often reflect loss of reasoning direction, and thus suggests that the extended CoT produced by reasoning models is highly compressible.
摘要
OpenAI o3和DeepSeek-R1等推理模型通过扩展的思维链(CoT)提示,在推理密集型任务中展现出强劲性能。虽然更长的推理轨迹有助于对复杂问题进行更彻底的求解路径探索,但研究者发现这些模型常出现"过度思考"现象,导致推理效率低下。本文提出ShorterBetter方法——一种简单而有效的强化学习策略,能使推理语言模型无需人工干预即可自主发现其最优CoT长度。该方法通过为每个问题采样多个输出,并将样本最优长度(SOL)定义为所有输出中最短的正确响应,动态引导模型趋向最优推理长度。在DeepSeek-Distill-Qwen-1.5B模型上的应用表明,ShorterBetter在保持准确率的同时,对领域内和领域外推理任务均实现了最高80%的输出长度缩减。分析显示,过长的推理轨迹往往反映推理方向的迷失,这表明推理模型生成的扩展CoT具有高度可压缩性。
MF-LLM: Simulating Collective Decision Dynamics via a Mean-Field Large Language Model Framework
Abstract
arXiv:2504.21582v1 Announce Type: new Abstract: Simulating collective decision-making involves more than aggregating individual behaviors; it arises from dynamic interactions among individuals. While large language models (LLMs) show promise for social simulation, existing approaches often exhibit deviations from real-world data. To address this gap, we propose the Mean-Field LLM (MF-LLM) framework, which explicitly models the feedback loop between micro-level decisions and macro-level population. MF-LLM alternates between two models: a policy model that generates individual actions based on personal states and group-level information, and a mean field model that updates the population distribution from the latest individual decisions. Together, they produce rollouts that simulate the evolving trajectories of collective decision-making. To better match real-world data, we introduce IB-Tune, a fine-tuning method for LLMs grounded in the information bottleneck principle, which maximizes the relevance of population distributions to future actions while minimizing redundancy with historical data. We evaluate MF-LLM on a real-world social dataset, where it reduces KL divergence to human population distributions by 47 percent over non-mean-field baselines, and enables accurate trend forecasting and intervention planning. It generalizes across seven domains and four LLM backbones, providing a scalable foundation for high-fidelity social simulation.
摘要
模拟集体决策不仅涉及个体行为的聚合,更源于个体间的动态交互。尽管大语言模型(LLMs)在社会模拟中展现出潜力,现有方法常与现实数据存在偏差。为弥合这一差距,我们提出平均场大语言模型(MF-LLM)框架,该框架显式建模微观决策与宏观群体间的反馈循环。MF-LLM交替运行两个模型:基于个体状态和群体信息生成个人行为的策略模型,以及根据最新个体决策更新群体分布的平均场模型。二者协同产生模拟集体决策演化轨迹的推演结果。为更好匹配现实数据,我们提出基于信息瓶颈原理的微调方法IB-Tune,其在最大化群体分布与未来行动相关性的同时,最小化与历史数据的冗余。我们在真实社会数据集上评估MF-LLM,其相较于非平均场基线方法将人类群体分布的KL散度降低47%,并能实现精准趋势预测与干预规划。该框架在七个领域和四种LLM骨干模型中均展现泛化能力,为高保真社会模拟提供了可扩展的基础。